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Abstract 



Maximum likelihood is one of the most widely used techniques to infer evolutionary histories. 
Although it is thought to be intractable, a proof of its hardness has been lacking. Here, we give 
a short proof that computing the maximum likelihood tree is NP-hard by exploiting a connection 
between likelihood and parsimony observed by Tufney and Steel. 

1 Introduction 

In a series of seminal works, Edwards and Cavalli-Sforza Neyman ^H^, and Felsenstein ap- 
plied the maximum likelihood methodology to the problem of inferring phylogenies from molecular 
sequences. Since, the many variants of this approach have gained increasing popularity in the sys- 
tematics literature. This is due in part to the flexibility of the technique in accomodating a variety of 
models of evolution as well as to good practical performance. Nevertheless, the approach is not without 
a flaw: it has been observed to be highly demanding computationally. Remarkably, the computational 
complexity status of the problem has remained elusive. Partial progress was made recently in where 
a variant of the problem, known as Ancestral Maximum Likelihood, was shown to be NP-hard. Here, 
we resolve the issue by proving that computing the maximum likelihood tree is NP-hard. Moreover, 
we show that the log-likelihood is NP-hard to approximate within a constant ratio. Our proof — which 
is mostly elementary — combines a connection between likelihood and parsimony observed by Tuffley 
and Steel p2] with a result on the hardness of approximating parsimony obtained by Wareham ^H] • 

General references on inferring phylogenies are j!21 116j . For background on NP-completeness and 
hardness of approximation, refer to |14l Many other popular phylogenetic techniques have been 
shown to be NP-hard, including parsimony [T3| 17]. compatibility 8 , and distance-based methods |2j. 

Remark. While writing this paper, Mike Steel brought to our attention that Benny Chor and 
Tamir Tuller have recently given an independent proof of this result which now appears in the pro- 
ceedings of RECOMB 2005 [3]. Similarly to the proof presented here, the Chor- Tuller paper uses 
results from JJJ and the hardness of approximating vertex cover (from which follows the hardness of 
approximating parsimony), but their argument proceeds from a sequence of rather involved construc- 
tions. Our reduction has the advantage of being short and elementary. It also sheds some more light 
on the interesting connection between likelihood and parsimony. 



2 Definitions and Results 

The use of maximum likelihood requires the choice of a statistical model of evolution. Here, we consider 
the simple binary symmetric model generally known as the Cavender-Farris model [1JEI]. We are given 



1 



a tree T on n leaves and probabilities of transition on edges p = {p e } e eE(T) £ [0> 1/2] , where E(T) 
is the set of edges of T and Er = 1-^(^)1 is the cardinality of E(T), (All trees considered here have 
no internal vertex of degree 2.) A realization of the model is obtained as follows: choose any vertex 
as a root; pick a state for the root uniformly at random in {0, 1}; moving away from the root, each 
edge e flips the state of its ancestor with probability p e . Let [n] denote the set of leaves. A character 
X assigns to each leaf a state in {0, 1}. An extension of x is an assignment of states in {0, 1} to all 
vertices of T which is equal to x on the leaves. The set of all extensions of % is denoted H(x)- 
In the Cavender-Farris model, the (modified) log-likelihood of x on (T, p) is 

£(x;T,p) = -ln2P[x|T,p] = -ln( £ J] pi{*(«0**W} (i _ p 6 )i{*(«)=*(«)} J , (i) 

\xeH( x ) e=(u,v)eE{T) J 

where 1{A} is 1 if A occurs, and otherwise. The data consists of a set of characters X — {Xi}i=i- 
Assuming the characters are independent and identically distributed, the log-likelihood of the data 
is the sum of the log-likelihood of all characters, viz. C(X;T, p) = X^i=i £(Xi? T, P)- The maximum 
likelihood (ML) problem consists in computing (T*,p*) minimizing C(X;T,p) over all trees and tran- 
sition probability vectors. Note that this is equivalent to minimizing the standard likelihood, without 
the factor of 2. 

Contrary to ML, maximum parsimony (MP) is not based on a statistical model. Denote by ch(x) 
the number of flips in x, i.e. ch(x) = \{(u, v) E E(T) : x{u) ^ x{ v )}\- Let l(x,T) be the smallest 
number of flips in any extension of x on T, i.e. l{x,T) = voio^jjua ch(x). The parsimony score of T 
is then l(X,T) = Y2i=i KXhT). The problem MP consists in finding the tree T** minimizing l(X,T) 
over all trees. 

A useful connection between ML and MP was noted by Tuffley and Steel in |17j : if one adds 
sufficiently many constant sites (i.e. = a, Vi G [n] for some a G {0, 1}) to the data and applies 
the maximum likelihood technique, then one necessarily chooses the most parsimonious tree. This 
could serve as the basis for a reduction, except that the Tuffley-Steel bounds require an exponential 
number of constant sites. Our contribution is to show that a polynomial number of sites imposes 
a weaker relationship between likelihood and parsimony, but that this is sufficient for the following 
reason. Parsimony is in fact hard to approximate, that is, even the seemingly easier task of obtaining 
a solution close to optimal is hard. This result is due to Wareham |18j . 

We prove the following theorem. We first define the notion of approximation algorithm. 

Definition 1 Let II be an optimization problem (minimization). Let L denote an instance of H and 
OPT(/) ; the optimal value of a solution to L. For c > 0, a (1 + c)- approximation algorithm for II is 
a polynomial-time algorithm that is guaranteed to return, for any instance L, a solution with objective 
value m satisfying m < (1 + c)OPT(J). 

Theorem 1 There exists a c> sufficiently small so that there is no (1 + c)- approximation algorithm 
for ML unless P = NP. In particular, ML is NP-hard. (The approximation claim relates to the 
modified log-likelihood C) 



3 Proof 

In this section, we prove our main result. The proof follows easily from the following propositions. The 
first proposition borrows heavily from although we need somewhat tighter estimates. The second 
proposition follows directly from the work of ^3 E] . 

Proposition 1 Let d > c > be constants. If there is a (1 + c)- approximation algorithm for ML then 
there is a (1 + d) -approximation algorithm for MP. 
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Proposition 2 ([18, 6j) There exists a c' > sufficiently small so that there is no (1+c') -approximation 
algorithm for MP unless P = NP. 

As in |17j . the reduction from MP to ML consists in adjoining a large number of constant sites 
to the data. Let s > be a small constant and M = max{2n,£;}. Fix iV c = M l l e . Denote by 
Xq = {xi}f=i the set X augmented with N c all-0 characters. For all x, let N x be the number of 
characters equal to x m To avoid the factors of 2 from the probability of the root state, we define 
P[X |p,r] = 2 k+Nc F[X | p, T] for p G [0,1/2]^. Also, let f x = 2P[x|p,T]. Let be the all-0 
character and 1, the all-1 character. We make three claims, from which Proposition ^ follows. 

Claim 1 Let e > and N c = M 1 / 6 for M = max{2n, k}. Let p e = q = E ^+h c ) > f or al1 e G E ( T )- 
Then - ln h!(k+NcP - C 1 + 2e )K x , T ), for M large enough. 
Proof. Note that, by a calculation identical to [171 Lemma 5], 




> i n U(x,T) {l _ q) E T \ > T)lnq _ Et ( g + 2? 2) ; 



\E T -ch(x) 



where we have used a standard Taylor expansion (note that we have q < 1/2 by definition). This 
bound applies in particular to the case x = 0- Then, as in |17l Lemma 5] again, 



InPLYo | p, T] _ 1 , , fNo+Nc tt ; 



ln(fc + N c ) 



x 



< 



1 



ln(k + N c 



{(k + N c )E T (q + 2q 2 ) - l(X, T) \nq) 



. ]nE T -lnl(X,T) 1 / l(X,T) 



ln(fc + JV c ) \n{k + N c ) \ E T (k + N c 



, InM 1 / M 

< KX,T) 1 + + — — - 1 + 2 



< (1 + 2e)Z(X,T), 

for M large enough. (Note that P[A^o | p, T] < 1 because ¥[x \ p,T] < 1/2 by symmetry.) ■ 

Claim 2 For all p G [0,1/2]^ such that - iff 1 < one ^ Pe < p, Ve G £7(T), with 

- _ ln(fc+jV c ) 

Proof. Assume edge e is such that p e > p. Take any two leaves u, v joined by a path going through 
e. As observed in |171 Formula (11)], the probability that x( u ) x( v ) is at least p e . In particular, the 
probability that a character is constant is less than 1 — p e and — ln/o > — ln(l — p e ) > p e (by the — 1 
symmetry). Therefore, 

lnP[X |p,T] _ 1 , ( fNo+N c TT fN x \ 1 ( ]n fN c \ *> ](Y T\ 

~ Mk + Nl) - ~Hk + N c ) ln y° 11 Ac I > -] n ( jfe + jv c ) I**) J > ^ ' 

hy Pe > P an d / x < 1 (by the — 1 symmetry). This contradicts the assumption. ■ 

Claim 3 Lete>0 and N c = M l / £ forM = max{2n, k}. For all p G [0, 1/2} Et , we have > 
(1 — 5e)l(X, T) for M large enough. 
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Proof. For this proof, we need a better estimate than |17l Lemma 6]. From Claim the result holds 
whenever max e p e > p. Therefore, we can assume that for all e 6 E(T), p e < p. Then, 

Et-Kx,T) . p x E T -l( x ,T) 

f x < y: f m) < E ( J? t,V +1(x,t) * E (E T pr W) <E T (E T py^, 

when M is large enough so that p < 1/Et- For constant sites, we use the bound /o,/i < 1 (by the 
— 1 symmetry). Therefore, 

lnP[X„|p,r] _ 1 Jmtt;. 

> '<*■?„ f_ 1 / V W vta jJ -In W.T)ln(* + %) 



ln(fc + iV c ) \ Z(X,T) \ ^ I iV c 



^ rTT T 2 (~ ln E T + In iV c ~ ln(£g.fc ln(fc + N c ))) 
1 + in JS C 

l(X, T) / 1/£ _ _ v 

~ 1 + lnMVe V 
> J(Jf,T)(l-5e), 

for M large enough. ■ 

Proof. (Proposition ^) Let T* be a maximum likelihood tree with corresponding edge probabilities 
p*, and T** , be a maximum parsimony tree. Assume we have a polynomial-time algorithm which is 
guaranteed to return a tree X" and edge probabilities p' such that 

-ln¥[X \T',p'] < (1 + c) (-lnP[X |r*,p*]) . 

Then the claims above and the optimality of T* imply that, if p** is chosen as in ClaimH(for T = T**), 

1 f-lnP[X \T',p']\ 



l(X,T') < 



< 



< 



1 - 5e V ln(fc + N e ) 
1 + c /-lnP[Xo |T*,p* 



1 - 5e \ ln(k + N c ) I 
1 + c f-]n¥[X \T**,p**] 



1 - 5e \ hx(k + N c 

< a + c >' 1 + 2g » i( xr 

l-5e 

< (l + c')Z(X,T**) 



for e small enough. ■ 

Proof. (Proposition |2J) Wareham |18[ Theorem 45 Part 3] gives a reduction from vertex cover with 
bounded degree B (B-VC) to maximum parsimony. (Wareham actually defines MP as a Steiner tree 
problem on the Hamming cube {0, l} k but the correspondence with our definition is straightforward.) 
The reduction is such that the existence of a (1 + c')-approximation algorithm for maximum parsimony 
implies the existence of a (1 + 2i?c')-approximation algorithm for B-VC. By [gj, for a sufficiently large 
B, there is no 1.16-approximation algorithm for B-VC unless P = NP. ■ 
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